Introduction

A decision tree is a simple, easy to interpret modeling technique for both regression and classification problems. Decision trees usually do not perform very well compared to other methods. Their relevance lies in the fact that they are the building blocks of two of the most successful ML algorithms as per today: random forests and gradient boosted trees. In this chapter, we will introduce these tree-based methods.

Decision Trees

How they work

In our journey to estimate the model $f$ by $\hat f$, we have considered mainly linear functions $f$ so far. We now move to a different function class: decision trees. They have been introduced in 1984 by Leo Breiman, Jerome Friedman and others [1] and are sometimes called "Classification and Regression Trees" (CART).

(Binary) decision trees are calculated recursively by partitioning the data in two pieces. Partitions are chosen to optimize the given loss by asking the best "yes/no" question about the covariates, e.g., "is carat < 1?" or "is color better than F?".

Typical loss functions (= "objectives") are the MSE for regression problems and information (= cross-entropy = log loss = logistic deviance) or Gini impurity for classification. The latter can be viewed as a "variance" for categorical variables.

Predictions are calculated by sending an observation through the tree, starting with the question at the "trunk" and ending in a "leaf". The prediction is the value associated with the leaf. For MSE loss, the leaf values equal the average response of all observations in the leaf. In classification settings, it may be the most frequent class in the leaf or all class probabilities.

The concept of a decision tree is best understood with an example.

Example: decision tree

We will use the dataCar data set to predict the claim probability with a decision tree. As features, we use veh_value, veh_body, veh_age, gender, area and agecat.

Comments

Properties of decision trees

These properties typically translate 1:1 to combinations of trees like random forests or boosted trees.

Random Forests

How they work

In 2001, Leo Breiman introduced a very powerful tree-based algorithm called random forest, see [2]. A random forest consists of many decision trees. To ensure that the trees differ, two sources or randomness are injected:

  1. Each split scans only a random selection "mtry" of the $m$ covariates to find the best split, usually about $\sqrt{m}$ or $m/3$. "mtry" is the main tuning parameter of a random forest.
  2. Each tree is calculated on a bootstrap sample of the data, i.e., on $n$ observations selected with replacement from the original $n$ rows. This technique is called "bagging", from "bootstrap aggregating".

Predictions are found by pooling the predictions of all trees, e.g., by averaging or majority voting.

Comments about random forests

Example: random forest

Let us now fit a random forest for diamond prices with typical parameters and 500 trees. 80% of the data is used for training, the other 20% we use for evaluating the performance. We use stratified splitting on the response variable. (Throughout the rest of the lecture, we will ignore the problematic aspect of having repeated rows for some diamonds.)

Comments

Interpreting a "black box"

In contrast to a single decision tree or a linear model, a combination of many trees is not easy to interpret. It is good practice for any ML model to study at least variable importance and the strongest effects, not just its performance. A pure prediction machine is hardly of any interest and might even contain mistakes like using covariates derived from the response. Model interpretation helps to fight such problems and thus also to increase trust in a model.

Variable importance

There are different approaches to measure the importance of a covariate. Since there is no general mathematical definition of "importance", the results of different approaches might be inconsistent across each other. For tree-based methods, a usual approach is to measure how many times a covariate $X$ was used in a split or how much loss improvement came from splitting on $X$.

Approaches that work for any supervised model (including neural nets) include permutation importance and SHAP importance.

Effects

One of the main reasons for the success of modern methods like random forests is the fact that they automatically learn interactions between two or more covariates. Thus, the effect of a covariate $X$ typically depends on the values of other covariates. In the extreme case, the effect of $X$ is different for each observation. The best what we can do is to study the average effect of $X$ over many observations, i.e., averaging the effects over interactions. This leads us to partial dependence plots: They work for any supervised ML model and are constructed as follows: A couple of observations are selected. Then, their average prediction is visualized against $X$ when sliding their value of $X$ over a reasonable grid of values, keeping all other variables fixed. The more natural the Ceteris Paribus clause, the more reliable the partial dependence plots.

Remark: A partial dependence plot of a covariate in a linear regression is simply a visualization of its coefficient.

Alternatives to partial dependence plots include accumulated local effect plots and SHAP dependence plots. Both relax the Ceteris Paribus clause.

Example: random forest (continued)

For our last example, we will now look at variable importance and partial dependence plots.

Comments

Exercises

  1. In above example, replace carat by its logarithm. Do the results change compared to the example without logs?
  2. Fit a random forest on the claims data for the binary variable clm using covariates veh_value, veh_body, veh_age, gender, area, and agecat. Choose a suitable tree depth either by cross-validation or by minimizing OOB error on the training data. Make sure to fit a probability random forest, i.e., predicting probabilities, not classes. Additionally, make sure to work with a relevant loss function (information/cross-entropy or Gini gain). Evaluate the final model on an independent test dataset. Interpret the results by split gain importance and partial dependence plots.

Gradient Boosted Trees

The idea of boosting was introduced by Schapire in 1990 [3] and roughly works as follows: A simple model is fit to the data. Then, another simple model is added, trying to correct the errors from the first model. This process is repeated until some stopping criterion triggers. As simple models or base learners, usually small decision trees are used, an idea pushed by Jerome Friedman in his famous 2001 article on the very general framework of gradient boosting machines [4].

Modern variants of such gradient boosted trees are XGBoost, LightGBM and CatBoost. These are the predominant algorithms in ML competitions on tabular data, check this comparison for differences with a screenshot as per Sept. 25, 2020:

Predictions are calculated similar to random forests, i.e., by combining predictions of all trees. As loss/objective function, one can choose among many possibilities. Often, using the same loss as a corresponding GLM is a good choice.

Example: XGBoost (untuned)

As an initial example on gradient boosting and XGBoost, we fit a model for diamond prices with MSE loss. The number of rounds/trees is chosen by early stopping on the validation data, i.e., until validation RMSE stops improving for a couple or rounds. The learning rate (weight of each tree) is chosen by trial and error in order to end up with a reasonably small/large number of trees, see the next section for more details.

Comments:

Parameters of gradient boosted trees

Gradient boosted trees offer a quite a lot of parameters. Unlike with random forests, they need to be tuned to achieve good results. It would be naive to use an algorithm like XGBoost without parameter tuning. Here is a selection:

Note on tuning: Since learning rate, number of boosting rounds and regularization parameters are heavily interdependent, a "big" randomized grid search CV to choose learning rate, boosting rounds and regularization is often not ideal. Above suggestion (fix learning rate, select number of rounds by early stopping and do grid search only on regularization parameters) is more focussed, see also the example below.

Note on evaluation metrics: While loss/objective functions are used by the algorithm to fit the model, an evaluation metric is a function used to monitor performance, tune parameters and select models. Ideally, it is consistent with the loss function of the specific algorithm (for instance RMSE as metric and MSE as loss), but in some cases it might differ. For classification, besides studying (multi-)log loss, e.g., one sometimes considers the confusion matrix and its derived measures like accuracy, precision, recall, F1-score etc. Wikipedia summarizes these concepts. The confusion matrix tabulates the combinations of actual classes and predicted classes. Note that such measures derived from the confusion matrix do not enjoy good statistical properties and focussing on them might lead to suboptimal models.

Example: XGBoost (tuned)

We will use XGBoost to fit diamond prices using MSE as loss function and evaluation metric, now using a sophisticated tuning strategy.

Overall, the modelling strategy is as follows:

  1. Using default regularization parameters, set the learning rate to get reasonable number of trees with CV-based early stopping.
  2. Use randomized search CV with early stopping to tune regularization parameters such as tree depth. Iterate if needed.
  3. Use the best parameter combination (incl. number of trees) to fit the model on the training data. "Best" typically means in terms of CV performance. As mentioned in the last chapter and depending on the situation, it can also mean "good CV performance and not too heavy overfit compared to insample performance" or some other reasonable criterion.

Now, the model is ready to be inspected by evaluating

Comment: The resulting model seems comparable to the random forest with slightly better performance.

The next code snipped shows how to use XGBoost's scikit-learn interface.

Exercises

  1. Study the online documentation of XGBoost to figure out how to make the model monotonically increasing in carat. Test your insights without rerunning the grid search in our last example, i.e., just by refitting the final model. How does the partial dependence plot for carat look now?
  2. Develop a strong XGBoost model for the claims data set with binary response clm and covariates veh_value, veh_body, veh_age, gender, area, and agecat. Use a clean cross-validation/test approach. Use log loss both as objective and evaluation metric. Interpret its results.
  3. Optional. Study the documentation of LightGBM. Use LightGBM to develop a competitor to the XGBoost claims model from Exercise 2. The XGBoost code needs to be slightly adapted. Compare grid search time.

Chapter Summary

In this chapter, we have met decision trees, random forests and tree boosting. Single decision trees are very easy to interpret but do not perform too well. On the other hand, tree ensembles like the random forest or gradient boosted trees usually perform very well but are tricky to interpret. We have introduced interpretation tools to look into such "black boxes". The main reason why random forests and boosted trees often provide more accurate models than a linear model lies in their ability to automatically learn interactions and other non-linear effects.

Chapter References

[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification and Regression Trees", Wadsworth, Belmont, CA, 1984.

[2] L. Breiman, "Random forests". In: Machine Learning, 2001, 45(1).

[3] R. Schapire, "The strength of weak learnability", Machine Learning, Vol 5, Nr. 2, 1990.

[4] J. Friedman, "Greedy Function Approximation: A Gradient Boosting Machine", 2001.